Model Report for census:tabular

Generated on 24 Feb 2025, 16:39   ●   3,374 original samples, 3,374 synthetic samples

Accuracy
89.8%
(94.8%)
Univariate 94.1%
(97.4%)
Bivariate 85.4%
(92.3%)
Similarity
Cosine Similarity 0.99814
(0.99784)
Discriminator AUC 73.0%
(57.3%)
Distances
Identical Matches 0.0%
(0.0%)
Average Distances 0.335
(0.344)

Correlations

Univariate Distributions

 
 

Bivariate Distributions

Accuracy

Column Univariate Bivariate
average_opaque_surface_transmittance 97.2% 87.2%
total_opaque_surface 97.0% 82.8%
surface_to_volume_ratio 96.9% 86.9%
total_glazed_surface 96.7% 84.4%
EPt 96.6% 89.3%
floors 96.5% 89.9%
net_area 96.3% 79.0%
yie 96.2% 87.8%
average_glazed_surface_transmittance 96.0% 87.2%
EPl 95.9% 88.1%
Cm 95.9% 83.9%
energy_vectors_used 95.4% 90.4%
dispersing_surface 95.4% 79.6%
total_effective_ventilation_flow 95.4% 87.2%
system_type 95.3% 89.5%
EPc 95.2% 89.2%
EPh 95.1% 86.1%
heated_gross_volume 95.1% 81.2%
Asol 94.8% 84.7%
installation_year 94.7% 88.0%
heated_usable_area 94.3% 80.4%
QHimp 94.0% 83.9%
air_changes 93.9% 87.0%
nominal_power 93.1% 84.1%
cooled_gross_volume 93.0% 84.1%
QHnd 92.7% 82.7%
EPv 92.5% 88.4%
degree_days 92.5% 86.2%
EPw 92.5% 87.0%
EPgl 92.4% 85.1%
DPR412_classification 91.7% 86.3%
construction_year 91.4% 85.6%
ventilation_type 91.2% 88.5%
cooled_usable_area 77.7% 73.5%
Total 94.1% 85.4%

Explainer
Accuracy of synthetic data is assessed by comparing the distributions of the synthetic (shown in green) and the original data (shown in gray). For each distribution plot we sum up the deviations across all categories, to get the so-called total variation distance (TVD). The reported accuracy is then simply reported as 100% - TVD. These accuracies are calculated for all univariate and bivariate distributions. A final accuracy score is then calculated as the average across all of these.

Similarity


Explainer
These plots show the first 3 principal components of training samples, synthetic samples, and (if available) holdout samples within the embedding space. The black dots visualize the centroids of the respective samples. The similarity metric then measures the cosine similarity between these centroids. We expect the cosine similarity to be close to 1, indicating that the synthetic samples are as similar to the training samples as the holdout samples are.

Distances

Synthetic vs. Training Data (Synthetic vs. Holdout Data)
Identical Matches 0.0% (0.0%)
Average Distances 0.335 (0.344)


Explainer
Synthetic data shall be as close to the original training samples, as it is close to original holdout samples, which serve us as a reference. This can be asserted empirically by measuring distances between synthetic samples to their closest original samples, whereas training and holdout sets are sampled to be of equal size. For the visualization above, the distances of synthetic samples to the training samples are displayed in green, and the distances of synthetic samples to the holdout samples (if available) displayed in gray. A green line that is significantly left of the gray line implies that synthetic samples are closer to the training samples than to the holdout samples, indicating that the data has overfitted to the training data. A green line that overlays with the gray line validates that the trained model indeed represents the general rules, that can be found in training just as well as in holdout samples.